Hypothesis:
SMG6 plays role in mediating nonsense-mediated mRNA decay (NMD). Therefore, I hypothesize that SMG6 KO cells will show upregulation of other genes involved in NMD such as UPF1 as compensation.
Dataset information:
RNA Extraction: NPCs were isolated from E13.5 embryo brains (cortices CTX and ganglionic eminences GE) after crossing \(Smg6^{flox}\) and \(Rosa26-CreER^{T2}\) mice. Kept in neurosphere cultures for 2 days followed by 4-Hydroxytamoxifen (4-OHT) treatment to induce the Smg6 deletion. Total RNA was isolated 6 days after 4-OHT treatment from cells with genotypes
Limitations: the samples are from in vitro cell cultures, therefore the generalizability of findings towards living organisms is limited. Also, the samples are mouse cells, which means the implication of the result is not directly related to human.
Figure. SMG6 project organization
The GitHub Repository of this project can be found here.
You can find all scripts in the git repository.
Download experiment metadata from NCBI.
Download run accession and ftp links from ENA. PRJNA776901 -> download report -> TSV
Navigate to working directory:
ssh.curie.pbtech
cd project_scripts
sbatch download_fastq.sh
SRR16684379 is shown as unavailable on ENA. Instead, I used
fasterq-dump SRR16684379 to download this sample.
sbatch run_trim_galore.sh "/athena/angsd/scratch/xil4009/SMG6_RNA_SEQ/fastq_raw"
trim_galore:
--gzip: compress output files.--illumina: specified because the RNA seq uses illumina
platform.sbatch run_fastqc.sh "/athena/angsd/scratch/xil4009/SMG6_RNA_SEQ/fastq_raw"
sbatch run_fastqc.sh "/athena/angsd/scratch/xil4009/SMG6_RNA_SEQ/fastq_trimmed"
fastqc --extract will uncompress the zipped output
files.
copy the html files to local:
cd /Users/jennie2000/Desktop/5004ANGSD/angsd_project/fastqc
scp xil4009@aphrodite.med.cornell.edu:/athena/angsd/scratch/xil4009/SMG6_RNA_SEQ/fastq_raw/fastqc/*.html .
mamba activate multiqc
multiqc /athena/angsd/scratch/xil4009/SMG6_RNA_SEQ/fastq_raw/fastqc/*_fastqc/
cd /Users/jennie2000/Desktop/5004ANGSD/angsd_project/
scp xil4009@aphrodite.med.cornell.edu:/home/xil4009/project_scripts/multiqc_report.html .
MultiQC results for trimmed fastq:
sbatch download_m39.sh
# will download m39 assembly and annotation, save in /athena/angsd/scratch/xil4009/SMG6_RNA_SEQ/genome
sbatch index_STAR_m39.sh
# STAR INDEX saved in /athena/angsd/scratch/xil4009/SMG6_RNA_SEQ/genome/m39_STARindex
Genome assembly & annotation are download from Ensembl.
Genome assembly: Mus_musculus.GRCm39.dna.primary_assembly.fa.gz
Annotation: Mus_musculus.GRCm39.109.gtf.gz
Run alignment on trimmed fastq files:
sbatch run_star_alignment.sh "/athena/angsd/scratch/xil4009/SMG6_RNA_SEQ/fastq_trimmed"
# alignment results saved in /athena/angsd/scratch/xil4009/SMG6_RNA_SEQ/alignment_star
STAR --outSAMtype BAM SortedByCoordinate: outputs sorted
bam files.
sbatch run_alignqc.sh
qorts QC:
--generatePlots: outputs plots.--singleEnded: specified because the RNA seq protocol
is single ended.--stranded: specified because the RNA seq protocol is
stranded.Download qorts QC plots to local:
cd /Users/jennie2000/Desktop/5004ANGSD/angsd_project
# this will search the remote path for ".pdf" files
# and then copying them, preserving the directory structure
rsync -arv --prune-empty-dirs \
--include="*/" --include="*.pdf" --exclude="*" \
xil4009@aphrodite.med.cornell.edu:/athena/angsd/scratch/xil4009/SMG6_RNA_SEQ/alignment_qc .
The first --include matches the folders, the second the
pdf file, then the --exclude prevents
downloading everything else.
Read distribution:
While the majority of reads are mapped to unique genes, a number of reads are mapped to intronic or UTR regions, which suggests usage of ribodepletion protocol.
Average gene body coverage:
Gene body coverage is biased towards 3’ end.
Strandedness:
Most reads are mapped to the first strand.
cd /Users/jennie2000/Desktop/5004ANGSD/angsd_project
scp -r xil4009@aphrodite.med.cornell.edu:/athena/angsd/scratch/xil4009/SMG6_RNA_SEQ/alignment_star/SRR16684375.Aligned.sortedByCoord.out.bam .
scp -r xil4009@aphrodite.med.cornell.edu:/athena/angsd/scratch/xil4009/SMG6_RNA_SEQ/alignment_star/SRR16684375.Aligned.sortedByCoord.out.bam.bai .
Gene Actb of control SRR16684375
Observations:
The library appears to be stranded, which is consistent with what is stated in the paper. There are a few alignments (blue) to the reverse strand. Some reads are mapped to non-exon regions, indicating ribo-depeletion protocol.
IGV gene coverage
mamba activate multiqc
multiqc /athena/angsd/scratch/xil4009/SMG6_RNA_SEQ
Alignment rates and of 11 samples
featureCounts documentation
flag -s: 0 (unstranded), 1 (stranded) and 2 (reversely
stranded). 0 by default.
parameter -s 2 is used.
sbatch run_feature_counts.sh
Copy featureCounts.txt.summary folder to local:
cd /Users/jennie2000/Desktop/5004ANGSD/angsd_project
scp -r xil4009@aphrodite.med.cornell.edu:/athena/angsd/scratch/xil4009/SMG6_RNA_SEQ/feature_counts .
library(glue)
library(DESeq2)
library(ggplot2); theme_set(theme_bw(base_size = 16)) # for making plots
library(magrittr) # for "pipe"-like coding in R
library(pheatmap)
library(AnnotationDbi)
library(RColorBrewer)
library(org.Mm.eg.db)
mm <- org.Mm.eg.db
library(EnhancedVolcano)
If needed, install pandas:
library(reticulate)
py_install("pandas")
import os
import sys
import pandas as pd
def add_sample_type(row):
if row['Genotype'] == "Smg6 Flox/Flox; RosaCreER tg/+" and row['TREATMENT'] == "no treatment":
return 'Control'
elif row['Genotype'] == "Smg6 Flox/Flox; RosaCreER +/+" and row['TREATMENT'] == "4-OHT":
return 'Control_4OHT'
elif row['Genotype'] == "Smg6 Flox/Flox; RosaCreER tg/+" and row['TREATMENT'] == "4-OHT":
return 'Smg6_iKO'
def sample_type(metadata, output):
md = pd.read_csv(metadata)
md['Samplename'] = md.apply(lambda row: add_sample_type(row), axis=1)
# only keep columns 'Run' and 'Samplename'
md = md[['Run', 'Samplename']]
md.to_csv(output, sep=' ', index=False, header=False)
sample_type("metadata/SraRunTable.txt", "metadata/runid_samplename.txt")
fc <- read.table("feature_counts_strand2/featureCounts.txt", header=TRUE)
metadata <- read.table("metadata/runid_samplename.txt")
# set the first column as row names
row.names(metadata) <- metadata[, 1]
metadata[, 1] <- NULL # remove the first column
# Sample name formatting
names(fc) <- gsub(".*(SRR)([0-9]+).*", "\\1\\2", names(fc))
col_names <- names(fc)
## add conditions to sample IDs
for ( n in 1:length(col_names) ) {
col <- col_names[n]
# if the col is the ID
if (grepl("SRR",col)) {
# get the condition of the ID
condition <- metadata[col, 1]
col_names[n] <- glue("{col}_{condition}")
}
}
names(fc) <- col_names
fc_summary <- read.table("feature_counts/featureCounts.txt.summary", header=TRUE)
names(fc_summary) <- gsub(".*(SRR)([0-9]+).*", "\\1\\2", names(fc_summary))
fc_summary
colData: data.frame with all the variables you know
about your samples, e.g., experimental condition, the type, and date of
sequencing and so on. Its row.names should correspond to the unique
sample names.
countData: should contain a matrix of the actual values
associated with the genes and samples. Conveniently, this is almost
exactly the format of the featureCounts output.
## gene IDs should be stored as row.names
row.names(fc) <- make.names(fc$Geneid)
## exclude the columns without read counts (columns 1 to 6 contain additional ## info such as genomic coordinates)
readcounts <- fc[ , -c(1:6)]
head(readcounts)
sample_info <- data.frame(condition = gsub("SRR[0-9]+_", "", names(readcounts)),
row.names = names(readcounts) )
sample_info
DESeq.ds <- DESeqDataSetFromMatrix(countData = as.matrix(readcounts),
colData = sample_info,
design = ~ condition)
## Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
## design formula are characters, converting to factors
# How many reads were sequenced for each sample ( = library sizes)
colSums(counts(DESeq.ds)) %>% barplot(las = 2, cex.names = 0.7)
Remove genes with no reads:
dim(DESeq.ds)
## [1] 57010 11
keep_genes <- rowSums(counts(DESeq.ds)) > 0
DESeq.ds <- DESeq.ds[ keep_genes, ]
dim(DESeq.ds)
## [1] 40239 11
DESeq.ds <- estimateSizeFactors(DESeq.ds) # calculate SFs, add them to object
plot( sizeFactors(DESeq.ds), colSums(counts(DESeq.ds)), # assess them
ylab = "library sizes", xlab = "size factors", cex = .6 )
Transform
par(mfrow=c(1,2)) # to plot the two box plots next to each other
## bp of non-normalized
boxplot(log2(counts(DESeq.ds) +1), notch=TRUE,
main = "Non-normalized read counts",
ylab ="log2(read counts)", cex = .6, las=2)
## bp of size-factor normalized values
boxplot(log2(counts(DESeq.ds, normalized=TRUE) +1), notch=TRUE,
main = "Size-factor-normalized read counts",
ylab ="log2(read counts)", cex = .6, las=2)
Make a scatterplot of log normalized counts against each other to see how well the actual values correlate which each other per sample and gene.
## non-normalized read counts plus pseudocount
assay(DESeq.ds, "log.counts") <- log2(counts(DESeq.ds, normalized = FALSE) + 1)
## normalized read counts
assay(DESeq.ds, "log.norm.counts") <- log2(counts(DESeq.ds, normalized=TRUE) + 1)
par(mfrow=c(1,2))
DESeq.ds[, c("SRR16684376_Control","SRR16684375_Control")] %>%
assay(., "log.norm.counts") %>%
plot(., cex=.1, main = "76_Control vs. 75_Control")
DESeq.ds[, c("SRR16684379_Control_4OHT","SRR16684378_Control_4OHT")] %>%
assay(., "log.norm.counts") %>%
plot(., cex=.1, main = "79_Control_4OHT vs 78_Control_4OHT")
The fanning out of the points in the lower left corner indicates that read counts correlate less well between replicates when they are low. the lower the mean read counts per gene, the higher the standard deviation.
Mean-SD Plot:
## generate the base meanSdPlot using sequencing depth normalized log2(read counts) ## set up plotting frame
par(mfrow=c(1,1))
## generate the plot
msd_plot <- vsn::meanSdPlot(assay(DESeq.ds, "log.norm.counts"),
ranks=FALSE, # show the data on the original scale
plot = FALSE)
msd_plot$gg +
ggtitle("Sequencing depth normalized log2(read counts)") +
ylab("standard deviation")
meanSdPlot manual: The red line depicts the running
median estimator. If there is no variance-mean dependence, then the line
should be approximately horizontal.
The plot here shows that there is some variance-mean dependence for genes with low read counts. This means that the data shows signs of heteroskedasticity. Many tools expect data to be homoskedastic, i.e., all variables should have similar variances.
rlog transformation:
DESeq.rlog <- rlog(DESeq.ds, blind = TRUE)
## set blind = FALSE if the conditions are expected to introduce ## strong differences in a large proportion of the genes
## (blind means blind to the experimental design)
par(mfrow=c(1,2))
plot(assay(DESeq.ds, "log.norm.counts")[,1:2], cex=.1,
main = "size factor and log2-transformed")
## the rlog-transformed counts are stored in the "assay" accessor
plot(assay(DESeq.rlog)[,1:2],
cex=.1, main = "rlog transformed",
xlab = colnames(assay(DESeq.rlog[,1])),
ylab = colnames(assay(DESeq.rlog[,2])) )
Mean-SD Plot:
## rlog-transformed read counts
msd_plot <- vsn::meanSdPlot(assay(DESeq.rlog), ranks=FALSE, plot = FALSE)
msd_plot$gg + ggtitle("Following rlog transformation") +
coord_cartesian(ylim = c(0,3))
select <- order(rowMeans(counts(DESeq.ds,normalized=TRUE)),
decreasing=TRUE)[1:20]
cdata <- colData(DESeq.ds)
pheatmap(assay(DESeq.rlog)[select,],
cluster_rows=FALSE,
show_rownames=FALSE,
cluster_cols=FALSE,
annotation_col=as.data.frame(cdata))
sampleDists <- dist(t(assay(DESeq.rlog)))
sampleDistMatrix <- as.matrix(sampleDists)
rownames(sampleDistMatrix) <- paste(DESeq.rlog$condition)
colnames(sampleDistMatrix) <- NULL
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
pheatmap(sampleDistMatrix,
clustering_distance_rows=sampleDists,
clustering_distance_cols=sampleDists,
col=colors)
plotPCA(DESeq.rlog, intgroup=c("condition"))
DESeq.ds
## class: DESeqDataSet
## dim: 40239 11
## metadata(1): version
## assays(3): counts log.counts log.norm.counts
## rownames(40239): ENSMUSG00000104385 ENSMUSG00000086053 ...
## ENSMUSG00002074899 ENSMUSG00000095742
## rowData names(0):
## colnames(11): SRR16684376_Control SRR16684375_Control ...
## SRR16684379_Control_4OHT SRR16684378_Control_4OHT
## colData names(2): condition sizeFactor
DESeq.ds$condition
## [1] Control Control Control_4OHT Control Control_4OHT
## [6] Smg6_iKO Smg6_iKO Smg6_iKO Smg6_iKO Control_4OHT
## [11] Control_4OHT
## Levels: Control Control_4OHT Smg6_iKO
## the following uses the magrittr assignment pipe
DESeq.ds$condition %<>% relevel(ref="Control")
DESeq.ds$condition
## [1] Control Control Control_4OHT Control Control_4OHT
## [6] Smg6_iKO Smg6_iKO Smg6_iKO Smg6_iKO Control_4OHT
## [11] Control_4OHT
## Levels: Control Control_4OHT Smg6_iKO
design(DESeq.ds)
## ~condition
DESeq.ds %<>% DESeq()
## using pre-existing size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
# normalize for diffs in sequencing depth and abundance per sample
# gene-wise dispersion estimates across all samples
# fit a neg. binomial GLM and compute Wald stat for each gene
You can see that the DESeq object now has additional
entries n the rowData slot:
DESeq.ds
## class: DESeqDataSet
## dim: 40239 11
## metadata(1): version
## assays(6): counts log.counts ... H cooks
## rownames(40239): ENSMUSG00000104385 ENSMUSG00000086053 ...
## ENSMUSG00002074899 ENSMUSG00000095742
## rowData names(26): baseMean baseVar ... deviance maxCooks
## colnames(11): SRR16684376_Control SRR16684375_Control ...
## SRR16684379_Control_4OHT SRR16684378_Control_4OHT
## colData names(2): condition sizeFactor
rowData(DESeq.ds) %>% colnames
## [1] "baseMean"
## [2] "baseVar"
## [3] "allZero"
## [4] "dispGeneEst"
## [5] "dispGeneIter"
## [6] "dispFit"
## [7] "dispersion"
## [8] "dispIter"
## [9] "dispOutlier"
## [10] "dispMAP"
## [11] "Intercept"
## [12] "condition_Control_4OHT_vs_Control"
## [13] "condition_Smg6_iKO_vs_Control"
## [14] "SE_Intercept"
## [15] "SE_condition_Control_4OHT_vs_Control"
## [16] "SE_condition_Smg6_iKO_vs_Control"
## [17] "WaldStatistic_Intercept"
## [18] "WaldStatistic_condition_Control_4OHT_vs_Control"
## [19] "WaldStatistic_condition_Smg6_iKO_vs_Control"
## [20] "WaldPvalue_Intercept"
## [21] "WaldPvalue_condition_Control_4OHT_vs_Control"
## [22] "WaldPvalue_condition_Smg6_iKO_vs_Control"
## [23] "betaConv"
## [24] "betaIter"
## [25] "deviance"
## [26] "maxCooks"
save(DESeq.ds, file="DESeq.ds.Rdata")
save(DESeq.rlog, file="DESeq.rlog.Rdata")
rowData(DESeq.ds)$WaldPvalue_condition_Smg6_iKO_vs_Control %>%
hist(breaks=19, main="Raw p-values for Smg6-iKO vs Control")
rowData(DESeq.ds)$WaldPvalue_condition_Control_4OHT_vs_Control %>%
hist(breaks=19, main="Raw p-values for Control-4OHT vs Control")
By default, will compare Smg6_iko vs control
DGE.results <- results(DESeq.ds, independentFiltering = TRUE, alpha = 0.05)
# the first line will tell you which comparison was done to achieve the log2FC
res.ctl.smg6 <- results(DESeq.ds, contrast=c("condition","Smg6_iKO", "Control"), independentFiltering = TRUE, alpha = 0.05)
res.ctl.ct4oht <- results(DESeq.ds, contrast=c("condition","Control_4OHT", "Control"))
res.ctl4oht.smg6 <- results(DESeq.ds, contrast=c("condition","Smg6_iKO", "Control_4OHT"))
# the first line will tell you which comparison was done to achieve the log2FC
head(res.ctl.smg6)
## log2 fold change (MLE): condition Smg6_iKO vs Control
## Wald test p-value: condition Smg6 iKO vs Control
## DataFrame with 6 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSMUSG00000104385 0.9620537 0.533272 1.648386 0.323511 0.746308
## ENSMUSG00000086053 0.0831445 -1.039083 3.972769 -0.261551 0.793667
## ENSMUSG00000102135 7.1935060 -0.299940 0.653222 -0.459169 0.646112
## ENSMUSG00000101097 0.0977052 0.000000 3.972769 0.000000 1.000000
## ENSMUSG00000100764 5.1448527 0.883793 0.705590 1.252559 0.210366
## ENSMUSG00000102534 0.4503458 -1.256679 2.241398 -0.560668 0.575024
## padj
## <numeric>
## ENSMUSG00000104385 NA
## ENSMUSG00000086053 NA
## ENSMUSG00000102135 NA
## ENSMUSG00000101097 NA
## ENSMUSG00000100764 NA
## ENSMUSG00000102534 NA
summary(res.ctl.smg6)
##
## out of 40239 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up) : 568, 1.4%
## LFC < 0 (down) : 1025, 2.5%
## outliers [1] : 5, 0.012%
## low counts [2] : 25743, 64%
## (mean count < 38)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
# the DESeqResult object can basically be handled like a data.frame
table(res.ctl.smg6$padj < 0.05)
##
## FALSE TRUE
## 12898 1593
AnnotationDbi::select(mm, keys="Smg6",keytype="SYMBOL", columns="ENSEMBL")
## 'select()' returned 1:1 mapping between keys and columns
plotCounts(DESeq.ds, gene="ENSMUSG00000038290", normalized=TRUE, xlab="", main="Smg6")
res.ctl.smg6.sorted <- res.ctl.smg6 %>% `[`(order(.$padj),)
res.ctl4oht.smg6.sorted <- res.ctl4oht.smg6 %>% `[`(order(.$padj),)
res.ctl.ct4oht.sorted <- res.ctl.ct4oht %>% `[`(order(.$padj),)
# save these three objects
save(res.ctl.smg6.sorted, res.ctl4oht.smg6.sorted, res.ctl.ct4oht.sorted, file="DEG.results.sorted.Rdata")
write.table(res.ctl.smg6.sorted, "DEG_results_Control_vs_SMG6iKO.csv", sep=",", quote=FALSE, row.names=TRUE)
write.table(res.ctl4oht.smg6.sorted, "DEG_results_Control4OHT_vs_SMG6iKO.csv", sep=",", quote=FALSE, row.names=TRUE)
write.table(res.ctl.ct4oht.sorted, "DEG_results_Control_vs_Control4OHT.csv", sep=",", quote=FALSE, row.names=TRUE)
DEG.res.sorted = res.ctl.smg6.sorted
DEG.res.sorted$padj %>%
hist(breaks=19, main="Adjusted p-values for Smg6 iKO vs Control")
# identify genes with the desired adjusted p-value cut-off
DGEgenes <- rownames(subset(DEG.res.sorted, padj < 0.05)) # extract rlog-transformed values into a matrix
rlog.dge <- DESeq.rlog[DGEgenes,] %>% assay
# heatmap of DEG sorted by p.adjust
pheatmap(rlog.dge,
scale="row",
show_rownames=FALSE,
main="DGE Control-4OHT vs SMG-iKO (row-based z-score)")
plotMA(DEG.res.sorted, alpha=0.05,
main="Test: p.adj.value < 0.05", ylim = c(-4,4))
Annotate with gene names:
annot.DGE <- AnnotationDbi::select(mm, keys=rownames(DEG.res.sorted),
keytype="ENSEMBL", columns="SYMBOL")
## 'select()' returned 1:many mapping between keys and columns
DEG.res.sorted$ENSEMBL <- rownames(DEG.res.sorted)
DEG.res.sorted.df <- as.data.frame(DEG.res.sorted)
DEG.res.sorted.df <- merge(x = DEG.res.sorted.df, y = annot.DGE, by= "ENSEMBL", all.x=TRUE)
head(DEG.res.sorted.df)
Volcano plot:
vp1 <- EnhancedVolcano(DEG.res.sorted.df,
lab=DEG.res.sorted.df$SYMBOL,
x='log2FoldChange', y='padj',
pCutoff=0.05,
title="Control-4OHT VS. SMG6-iKO")
print(vp1)
DEG.res.sorted.df.sig <- subset(DEG.res.sorted.df, (abs(DEG.res.sorted.df$log2FoldChange) > 0.2 & DEG.res.sorted.df$padj < 0.05)) %>% `[`(order(.$padj),)
head(DEG.res.sorted.df.sig)
library(enrichR)
dbs <- listEnrichrDbs()
library(clusterProfiler)
library(cowplot)
gmt file is downlaoded from GSEA mouse gene set NMD
gmt_df <- read.gmt("GOBP_NUCLEAR_TRANSCRIBED_MRNA_CATABOLIC_PROCESS_NONSENSE_MEDIATED_DECAY.v2023.1.Mm.gmt")
## Warning in readLines(gmtfile): incomplete final line found on
## 'GOBP_NUCLEAR_TRANSCRIBED_MRNA_CATABOLIC_PROCESS_NONSENSE_MEDIATED_DECAY.v2023.1.Mm.gmt'
annot.gmt <- AnnotationDbi::select(mm, keys=gmt_df$gene, keytype="SYMBOL", columns="ENSEMBL", multiVals="first")
## 'select()' returned 1:1 mapping between keys and columns
head(annot.gmt)
annot.gmt <- annot.gmt[!duplicated(annot.gmt$SYMBOL), ]
gmt_df$gene <- annot.gmt$ENSEMBL
# rank genes by log2FoldChange
gene_list = DEG.res.sorted.df$log2FoldChange
names(gene_list) <- DEG.res.sorted.df$ENSEMBL
gene_list<- sort(gene_list, decreasing = TRUE)
gse.nmdset <- GSEA(gene_list, TERM2GENE = gmt_df, pvalueCutoff = 1)
## preparing geneSet collections...
## GSEA analysis...
## Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : There are ties in the preranked stats (19.15% of the list).
## The order of those tied genes will be arbitrary, which may produce unexpected results.
## Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize,
## gseaParam, : There are duplicate gene names, fgsea may produce unexpected
## results.
## leading edge analysis...
## done...
gse.nmdset # check the result
## #
## # Gene Set Enrichment Analysis
## #
## #...@organism UNKNOWN
## #...@setType UNKNOWN
## #...@geneList Named num [1:40346] 4.47 4.44 4.18 4.12 3.96 ...
## - attr(*, "names")= chr [1:40346] "ENSMUSG00000029371" "ENSMUSG00000082534" "ENSMUSG00000112424" "ENSMUSG00000040969" ...
## #...nPerm
## #...pvalues adjusted by 'BH' with cutoff <1
## #...1 enriched terms found
## 'data.frame': 1 obs. of 11 variables:
## $ ID : chr "GOBP_NUCLEAR_TRANSCRIBED_MRNA_CATABOLIC_PROCESS_NONSENSE_MEDIATED_DECAY"
## $ Description : chr "GOBP_NUCLEAR_TRANSCRIBED_MRNA_CATABOLIC_PROCESS_NONSENSE_MEDIATED_DECAY"
## $ setSize : int 42
## $ enrichmentScore: num -0.253
## $ NES : num -0.812
## $ pvalue : num 0.823
## $ p.adjust : num 0.823
## $ qvalue : logi NA
## $ rank : num 30147
## $ leading_edge : chr "tags=100%, list=75%, signal=25%"
## $ core_enrichment: chr "ENSMUSG00000001415/ENSMUSG00000094973/ENSMUSG00000073460/ENSMUSG00000040613/ENSMUSG00000002210/ENSMUSG000000583"| __truncated__
## #...Citation
## T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu.
## clusterProfiler 4.0: A universal enrichment tool for interpreting omics data.
## The Innovation. 2021, 2(3):100141
gseaplot(gse.nmdset, geneSetID = 1, by = "runningScore", title="NMD pathway")
sig_DEGs <- subset(DEG.res.sorted.df.sig, padj < 0.05, abs(log2FoldChange)>0.2)
sig_upReg_DEGs <- subset(sig_DEGs, log2FoldChange > 0)
sig_downReg_DEGs <- subset(sig_DEGs, log2FoldChange < 0)
sig_DEGs <- enrichr(as.vector(na.omit(sig_upReg_DEGs$SYMBOL)), "KEGG_2019_Mouse")
## Uploading data to Enrichr... Done.
## Querying KEGG_2019_Mouse... Done.
## Parsing results... Done.
upReg_pathways <- enrichr(as.vector(na.omit(sig_upReg_DEGs$SYMBOL)), "KEGG_2019_Mouse")
## Uploading data to Enrichr... Done.
## Querying KEGG_2019_Mouse... Done.
## Parsing results... Done.
downReg_pathways <- enrichr(as.vector(na.omit(sig_downReg_DEGs$SYMBOL)), "KEGG_2019_Mouse")
## Uploading data to Enrichr... Done.
## Querying KEGG_2019_Mouse... Done.
## Parsing results... Done.
sigDEG_plot <- plotEnrich(sig_DEGs[[1]], showTerms = 15, numChar = 40, y = "Count", orderBy = "Adjusted.P.value", title = "enriched pathways, padj < 0.05, logFoldChange>0.2")
sigDEG_plot
upReg_plot <- plotEnrich(upReg_pathways[[1]], showTerms = 15, numChar = 40, y = "Count", orderBy = "Adjusted.P.value", title = "Upregulated pathways")
downReg_plot <- plotEnrich(downReg_pathways[[1]], showTerms = 15, numChar = 40, y = "Count", orderBy = "Adjusted.P.value")
combined_plot <- plot_grid(upReg_plot, downReg_plot, labels = c('A', 'B'))
upReg_plot
downReg_plot
ggsave(
"Pathway_enrichment.png",
combined_plot ,
width = 12,
height = 5,
units = "in",
dpi = 300
)
Authors’ feature counts for SMG6:
cd GSE186964_RAW/
grep "ENSMUSG00000038290" *.txt | cut -f 7
2003
2483
1942
1787
1879
2238
1885
2144
1698
2000
2133
Download bam files:
cd /Users/jennie2000/Desktop/5004ANGSD/angsd_project/bam_files
bam="/home/xil4009/scratch_angsd/SMG6_RNA_SEQ/alignment_star/SRR16684385.Aligned.sortedByCoord.out.bam"
bai="/home/xil4009/scratch_angsd/SMG6_RNA_SEQ/alignment_star/SRR16684385.Aligned.sortedByCoord.out.bam.bai"
scp -r xil4009@aphrodite.med.cornell.edu:${bam or bai} .
scp -r xil4009@aphrodite.med.cornell.edu:${bai} .
scp -r xil4009@aphrodite.med.cornell.edu:/home/xil4009/scratch_angsd/SMG6_RNA_SEQ/alignment_star/*.{bam,bai} .